Skip to content

Cleanup Docker image handling, ECS service management, and model training resources#793

Merged
chrisaddy merged 9 commits intomasterfrom
docker-image-handling-updates
Mar 18, 2026
Merged

Cleanup Docker image handling, ECS service management, and model training resources#793
chrisaddy merged 9 commits intomasterfrom
docker-image-handling-updates

Conversation

@forstmeier
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier commented Mar 16, 2026

Overview

Changes

  • reorganize/consolidate Docker image handling
  • cleanup GitHub workflows ECS/data management
  • update AWS/Pulumi resource naming
  • consolidate TiDE model training resouces

Context

Some additional fixes given the new Prefect pattern.

Summary by CodeRabbit

  • Chores

    • Standardized service names with "-server" suffix and aligned image/repository naming.
    • Consolidated model-training components into server/worker roles and switched build/push semantics to package_name.
    • Runbook updated for automated environment secret population.
  • New Features

    • Added in-tool deploy command and a matrix-based deploy workflow for package+stage deployments.
  • Behavior Changes

    • Removed automated scheduled data-sync and switched teardown to manual-only runs.
    • Model training deployments limited to weekdays with a configured timezone.

Copilot AI review requested due to automatic review settings March 16, 2026 01:05
@github-project-automation github-project-automation Bot moved this to In Progress in Overview Mar 16, 2026
@github-actions github-actions Bot requested a review from chrisaddy March 16, 2026 01:05
@github-actions github-actions Bot added python Python code updates markdown Markdown code updates yaml YAML code updates labels Mar 16, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Consolidates CI build/push into an application+stage matrix with a separate deploy_images job, removes scheduled sync/teardown triggers, renames trainer/training → runner/model-trainer across infra/storage/compute/exports, refactors TiDE training into workflow.train_model, and updates Maskfile, Dockerfiles, tests, and docs.

Changes

Cohort / File(s) Summary
CI/CD workflows
​.github/workflows/launch_infrastructure.yaml, ​.github/workflows/teardown_infrastructure.yaml, ​.github/workflows/sync_data.yaml
Converted launch matrix from serviceapplication+stage, renamed build job to include application-stage, added deploy_images job (matrix deploy), removed scheduled triggers and deleted the separate sync_data workflow.
Mask & build tooling
maskfile.md, ​.github/workflows/launch_infrastructure.yaml
Switched build/push command signatures from application_namepackage_name, added commit-hash existence check/ECR login, caching adjustments, and a new deploy (package_name) (stage_name) command; updated service mapping to -server/-worker names.
Infrastructure exports & entrypoint
infrastructure/__main__.py
Renamed/re-exported ECR outputs and image URIs from trainer→runner and training→model-trainer-server-worker; updated training API host to model-trainer-server and removed legacy training_worker_image export.
Storage / ECR repositories
infrastructure/storage.py, libraries/python/tests/test_infrastructure_storage.py
Replaced tide-trainertide-runner and training-server/workermodel-trainer-server-worker repositories; adjusted lifecycle resource names and image URI variables; updated tests to assert new lifecycle names.
Compute resources & naming
infrastructure/compute.py
Mass renames and refactors: trainingmodel-trainer, many services/tasks gain -server suffix, updated target groups, service discovery, RDS/ElastiCache names, env vars, secrets, and dependency refs to match new model-trainer naming.
Models: TiDE training consolidation
models/tide/src/tide/workflow.py, models/tide/src/tide/trainer.py
Removed legacy trainer.py; introduced DEFAULT_CONFIGURATION and train_model in workflow.py (deferred imports) and updated call sites to use the new workflow entrypoint.
Models: scheduling & Dockerfiles
models/tide/src/tide/deploy.py, models/tide/tests/test_deploy.py, models/Dockerfile, models/tide/Dockerfile
Changed TiDE scheduled cron to weekdays and added timezone="America/New_York"; renamed Docker build stages (workerserver-worker, trainerrunner); tests adjusted accordingly.
Models: tests updates
models/tide/tests/test_trainer.py, models/tide/tests/test_workflow.py
Removed legacy test_trainer.py; expanded test_workflow.py to exercise new train_model/DEFAULT_CONFIGURATION behavior; updated tests to patch/train via workflow.train_model.
Docs / runbook & tooling
infrastructure/github_environment_runbook.md, .claude/commands/update-pull-request.md
Runbook removed manual Pulumi secret extraction and added PULUMI_ACCESS_TOKEN guidance; PR tooling moved many update-pull-request flows from GraphQL→REST, refactored scratchpad files and posting/mutation logic.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Developer
    participant GHA as "GitHub Actions"
    participant Matrix as "workflow matrix"
    participant Mask as "Mask / Flox"
    participant ECR as "AWS ECR"
    participant ECS as "AWS ECS"
    Developer->>GHA: push / workflow_dispatch
    GHA->>Matrix: expand application+stage matrix
    Matrix->>Mask: run build (package, stage)
    Mask->>ECR: check image by commit-hash
    alt image exists
        Mask-->>Matrix: skip build
    else image missing
        Mask->>Mask: build image (package-stage)
        Mask->>ECR: push image (package-stage)
    end
    GHA->>GHA: deploy_images job triggered (matrix)
    GHA->>ECS: deploy service using matrix.application/matrix.stage
    ECS->>ECS: wait for service stability
    GHA->>Developer: workflow complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary changes: Docker image handling reorganization, ECS service management updates, and model training resource consolidation across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docker-image-handling-updates
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

Migrating from UI to YAML configuration.

Use the @coderabbitai configuration command in a PR comment to get a dump of all your UI settings in YAML format. You can then edit this YAML file and upload it to the root of your repository to configure CodeRabbit programmatically.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 16, 2026

Greptile Summary

This PR consolidates Docker image handling for TiDE model training (server + worker now share a single ECR repo and Dockerfile), renames ECS services/task families/log groups to a consistent -server/-worker suffix convention, removes the automated scheduled launch/teardown triggers in favour of push-only and manual-only runs, and moves per-service ECS deployments into a dedicated deploy_images workflow job rather than the Pulumi stack-up script.

Key changes:

  • Single shared models/Dockerfile (renamed from tools/Dockerfile) replaces two separate ECR repos (fund/training-server, fund/training-worker) with one fund/model-trainer-server-worker repo.
  • New mask infrastructure image deploy command triggers aws ecs update-service --force-new-deployment and waits for stabilisation per service, replacing the inline bash loop that was inside the Pulumi stack-up script.
  • TiDE trainer.py deleted; train_model inlined into workflow.py with deferred tinygrad imports kept intact.
  • Critical: The server ECS task definition (training_server_task_definition) sets only command (Docker CMD), not entryPoint. Because models/Dockerfile declares an ENTRYPOINT pointing to the Prefect worker command, Docker will append the command value as arguments to the worker ENTRYPOINT rather than replacing it. The Prefect server will never start; the service will crash-loop and fail its ALB health check.
  • ECR lifecycle policy countNumber corrected from 0 to 1 day (a 0-day threshold expired untagged images immediately on push).

Confidence Score: 1/5

  • Not safe to merge — the Prefect server ECS service will crash-loop on every deployment until the missing entryPoint override is added.
  • The consolidation of server + worker into a single Docker image is architecturally sound, but the server's ECS container definition does not override the image's ENTRYPOINT (which is the Prefect worker command). ECS command maps to Docker CMD and is appended to the ENTRYPOINT, so the server container will run the worker process with garbage arguments and immediately error out. This breaks the model-trainer-server ECS service in production.
  • infrastructure/compute.py (server task definition around line 495) and models/Dockerfile (ENTRYPOINT declaration) need to be addressed together to fix the server startup issue.

Important Files Changed

Filename Overview
infrastructure/compute.py Renames all ECS services, task families, log groups, and service-discovery entries to the new "-server"/"-worker" convention; consolidates server+worker into one shared image URI — but the server task definition is missing an entryPoint override, causing the Prefect server to fail to start (see P0 comment).
infrastructure/storage.py Reverts S3 buckets from BucketV2 back to Bucket, fixes ECR repo names to use hyphens consistently, consolidates two separate training ECR repos into a single fund/model-trainer-server-worker repo, and fixes the lifecycle policy countNumber from 0 to 1 (preventing immediate expiry of untagged images).
models/Dockerfile Renamed from tools/Dockerfile; renames the final stage from worker to server-worker and copies models/tide/ into the build. The ENTRYPOINT remains the Prefect worker command, which conflicts with the server ECS task definition that only sets command (see P0 comment on compute.py).
.github/workflows/launch_infrastructure.yaml Removes scheduled launch trigger and adds a matrix-based deploy_images job that runs per-service ECS force-new-deployment after Pulumi completes; adds model-trainer as a build/push matrix entry and splits server/worker as separate deploy entries.
maskfile.md Renames imagesimage and servicesservice subcommands, adds a new deploy command that runs per-service ECS deployments, and fixes the build command to map hyphenated package names to underscore directory names when resolving Dockerfile paths.
models/tide/src/tide/workflow.py Absorbs the deleted trainer.py module by inlining train_model directly into workflow.py with deferred imports for tinygrad-heavy dependencies; updates the train_tide_model task to call the local function instead of the removed module import.
models/tide/src/tide/deploy.py Restricts the TiDE training schedule to weekdays only (0 22 * * 1-5) and adds timezone="America/New_York" so the cron fires at 10 PM ET rather than UTC.

Comments Outside Diff (1)

  1. infrastructure/compute.py, line 495-507 (link)

    Server starts worker process instead of Prefect server

    models/Dockerfile sets the image-level ENTRYPOINT to the worker command:

    ENTRYPOINT ["uv", "run", "--package", "tools", "prefect", "worker", "start", "--pool", "training-pool", "--type", "process"]
    

    The server container definition here only overrides command (Docker CMD), not entryPoint. In Docker/ECS, command is appended to the existing ENTRYPOINT rather than replacing it. The server container would therefore execute:

    uv run --package tools prefect worker start --pool training-pool --type process bash -c "..."
    

    prefect worker start would receive bash and -c "..." as extra positional arguments and error out. The Prefect server would never start, causing the ECS service to crash-loop and the ALB health check on port 4200 to fail.

    The fix is to add an entryPoint override that resets the Docker ENTRYPOINT so the command can run bash -c "..." directly:

    "entryPoint": ["bash", "-c"],
    "command": [
        (
            "export PREFECT_API_DATABASE_CONNECTION_URL="
            '$(python3 -c "'
            "import os, urllib.parse;"
            "p=urllib.parse.quote(os.environ['PREFECT_DB_PASSWORD'],safe='');"
            f"print(f'postgresql+asyncpg://prefect:{{p}}@{args[1]}/prefect')"
            '")'
            " && prefect server start --host 0.0.0.0"
        )
    ],

    Alternatively, change models/Dockerfile to use CMD instead of ENTRYPOINT for the default worker command so ECS command can fully replace it.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: infrastructure/compute.py
Line: 495-507

Comment:
**Server starts worker process instead of Prefect server**

`models/Dockerfile` sets the image-level `ENTRYPOINT` to the worker command:

```
ENTRYPOINT ["uv", "run", "--package", "tools", "prefect", "worker", "start", "--pool", "training-pool", "--type", "process"]
```

The server container definition here only overrides `command` (Docker CMD), not `entryPoint`. In Docker/ECS, `command` is **appended** to the existing `ENTRYPOINT` rather than replacing it. The server container would therefore execute:

```
uv run --package tools prefect worker start --pool training-pool --type process bash -c "..."
```

`prefect worker start` would receive `bash` and `-c "..."` as extra positional arguments and error out. The Prefect server would never start, causing the ECS service to crash-loop and the ALB health check on port 4200 to fail.

The fix is to add an `entryPoint` override that resets the Docker ENTRYPOINT so the `command` can run `bash -c "..."` directly:

```python
"entryPoint": ["bash", "-c"],
"command": [
    (
        "export PREFECT_API_DATABASE_CONNECTION_URL="
        '$(python3 -c "'
        "import os, urllib.parse;"
        "p=urllib.parse.quote(os.environ['PREFECT_DB_PASSWORD'],safe='');"
        f"print(f'postgresql+asyncpg://prefect:{{p}}@{args[1]}/prefect')"
        '")'
        " && prefect server start --host 0.0.0.0"
    )
],
```

Alternatively, change `models/Dockerfile` to use `CMD` instead of `ENTRYPOINT` for the default worker command so ECS `command` can fully replace it.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: "Add fixes after manu..."

Comment thread maskfile.md
Comment thread infrastructure/storage.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors model-training and infrastructure tooling to consolidate Docker/ECR handling, align ECS/Pulumi naming, and simplify Prefect-related workflows/resources.

Changes:

  • Moves TiDE train_model (+ configuration) into tide.workflow and relocates tests accordingly.
  • Renames/standardizes infrastructure resources (ECR repos, ECS services, service discovery names) around the new “model-trainer” and “*-server” naming.
  • Reworks maskfile.md and GitHub Actions to build/push/deploy images via unified commands and updated service naming.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
models/tide/tests/test_workflow.py Adds train_model tests and updates patch targets after moving training logic.
models/tide/tests/test_trainer.py Removes redundant trainer-module tests (now covered in workflow tests).
models/tide/tests/test_deploy.py Updates expected Prefect cron schedule string.
models/tide/src/tide/workflow.py Introduces DEFAULT_CONFIGURATION + train_model directly in workflow module; removes trainer import in flow.
models/tide/src/tide/trainer.py Deletes legacy trainer module (functionality moved).
models/tide/src/tide/deploy.py Updates Prefect deployment schedule and sets a timezone.
models/tide/Dockerfile Renames CUDA stage from trainer to runner.
models/Dockerfile Renames the final stage to server-worker.
maskfile.md Consolidates image commands, adds ECS deploy command, updates model deploy/train/download commands.
libraries/python/tests/test_infrastructure_storage.py Updates assertions for renamed ECR lifecycle resources.
infrastructure/storage.py Renames ECR repositories and consolidates model-trainer to a single repo.
infrastructure/github_environment_runbook.md Simplifies runbook instructions (bootstrap now sets secrets).
infrastructure/compute.py Renames ECS/ALB/SG/RDS/Redis resources and switches server/worker to shared model-trainer image URI.
infrastructure/main.py Updates Pulumi exports for renamed ECR resources and new Prefect API URL.
.github/workflows/teardown_infrastructure.yaml Removes scheduled teardown trigger.
.github/workflows/sync_data.yaml Removes repository-dispatch-based data sync workflow.
.github/workflows/launch_infrastructure.yaml Updates matrix to build/push/deploy per app-stage and removes schedule + sync trigger.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread maskfile.md Outdated
Comment thread maskfile.md
Comment thread maskfile.md
Comment thread maskfile.md Outdated
Comment thread maskfile.md
Comment thread .github/workflows/launch_infrastructure.yaml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infrastructure/compute.py (1)

363-380: ⚠️ Potential issue | 🔴 Critical

This renames the live Prefect database instead of migrating it.

Changing both the Pulumi logical resource name (from "prefect_database" to "model_trainer_state") and the RDS identifier (from "fund-prefect" to "fund-model-trainer-state") creates a new instance, not an in-place rename. The old database fund-prefect will be orphaned because Pulumi will attempt to delete it, but Pulumi's own deletion_protection=True flag on the new resource does not affect the old resource. However, depending on how Pulumi handles this drift, you risk losing the existing Prefect state or creating deployment failures. Please migrate using a snapshot/restore process or revert the logical name and identifier back to their original values to perform an in-place update.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure/compute.py` around lines 363 - 380, You renamed the Pulumi
logical resource and the RDS identifier (new resource model_trainer_state with
identifier "fund-model-trainer-state"), which causes Pulumi to create a new RDS
instance and orphan the existing "fund-prefect" DB; to fix, either revert the
Pulumi logical name and identifier back to the original values (restore the
resource name prefect_database and identifier "fund-prefect") so Pulumi will
perform an in-place update, or implement an explicit snapshot/restore migration:
take a snapshot of the existing instance, restore that snapshot into the desired
new instance identifier, import/replace the Pulumi resource state as needed, and
ensure deletion_protection/backups are configured so the original instance isn’t
inadvertently destroyed; update the code around model_trainer_state, identifier,
and any references to the original prefect_database/fund-prefect to match the
chosen approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/compute.py`:
- Around line 476-480: The server and worker task definitions both use the
shared variable model_trainer_server_worker_image_uri, causing image race
conditions; change the deployment to use distinct image identifiers (e.g.,
model_trainer_server_image_uri and model_trainer_worker_image_uri) or pin
immutable digests for each stage and replace
model_trainer_server_worker_image_uri in container_definitions and the other
occurrence (the similar container_definitions block around the second mention)
with the appropriate per-role image variable so server and worker always get the
correct image.

In `@maskfile.md`:
- Around line 213-214: The ECS waiter timeout is too short (Delay: 10,
MaxAttempts: 6 → 60s); update the aws ecs wait services-stable command to allow
up to ~5 minutes by increasing the waiter-config (e.g., keep "Delay": 10 and set
"MaxAttempts": 30 or set "Delay": 20 and "MaxAttempts": 15) so the ecs wait
services-stable --cluster "$cluster" --services "$service" uses the longer
timeout for stabilization.

In `@models/tide/src/tide/workflow.py`:
- Around line 41-45: The train_model signature currently types configuration as
plain dict | None; tighten it by declaring and using a more specific
configuration type (e.g., a TypedDict named ModelConfiguration or at minimum
dict[str, Any]) and update the parameter type to ModelConfiguration | None (or
dict[str, Any] | None) to match the shape seen in DEFAULT_CONFIGURATION; update
any imports (typing.TypedDict/Any) and adjust usages in train_model to use the
new type alias so type checkers can validate keys referenced inside the
function.

In `@models/tide/tests/test_deploy.py`:
- Around line 25-27: The test is missing an assertion for the new timezone
parameter: update the assertions in the test (where call_kwargs,
call_kwargs["parameters"], cron and LOOKBACK_DAYS are checked) to include assert
call_kwargs["parameters"]["timezone"] == "America/New_York" so the deployed
job's timezone is validated alongside cron, base_url, and lookback days.

In `@models/tide/tests/test_workflow.py`:
- Around line 171-179: The two tests test_train_model_returns_model_and_data and
test_train_model_uses_default_configuration call train_model() without
overriding epoch_count and thus depend on production defaults; update these
tests to either inject a cheap test configuration (e.g., pass epoch_count=1 or a
TestConfig object) into train_model or stub/mock the model.fit path so training
is cheap and deterministic—look for train_model, epoch_count and any internal
fit/training call to patch or parameterize, and make the same change for the
other affected tests (the ones around the 201-207 range) so all unit tests pin a
minimal training budget.

---

Outside diff comments:
In `@infrastructure/compute.py`:
- Around line 363-380: You renamed the Pulumi logical resource and the RDS
identifier (new resource model_trainer_state with identifier
"fund-model-trainer-state"), which causes Pulumi to create a new RDS instance
and orphan the existing "fund-prefect" DB; to fix, either revert the Pulumi
logical name and identifier back to the original values (restore the resource
name prefect_database and identifier "fund-prefect") so Pulumi will perform an
in-place update, or implement an explicit snapshot/restore migration: take a
snapshot of the existing instance, restore that snapshot into the desired new
instance identifier, import/replace the Pulumi resource state as needed, and
ensure deletion_protection/backups are configured so the original instance isn’t
inadvertently destroyed; update the code around model_trainer_state, identifier,
and any references to the original prefect_database/fund-prefect to match the
chosen approach.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8d146e9a-23e4-490e-93b7-9bce689ed45b

📥 Commits

Reviewing files that changed from the base of the PR and between 9418e89 and eecab19.

📒 Files selected for processing (17)
  • .github/workflows/launch_infrastructure.yaml
  • .github/workflows/sync_data.yaml
  • .github/workflows/teardown_infrastructure.yaml
  • infrastructure/__main__.py
  • infrastructure/compute.py
  • infrastructure/github_environment_runbook.md
  • infrastructure/storage.py
  • libraries/python/tests/test_infrastructure_storage.py
  • maskfile.md
  • models/Dockerfile
  • models/tide/Dockerfile
  • models/tide/src/tide/deploy.py
  • models/tide/src/tide/trainer.py
  • models/tide/src/tide/workflow.py
  • models/tide/tests/test_deploy.py
  • models/tide/tests/test_trainer.py
  • models/tide/tests/test_workflow.py
💤 Files with no reviewable changes (4)
  • .github/workflows/teardown_infrastructure.yaml
  • .github/workflows/sync_data.yaml
  • models/tide/src/tide/trainer.py
  • models/tide/tests/test_trainer.py

Comment thread infrastructure/compute.py
Comment thread maskfile.md Outdated
Comment thread models/tide/src/tide/workflow.py
Comment thread models/tide/tests/test_deploy.py
Comment thread models/tide/tests/test_workflow.py
Fix five issues identified in code review:

- maskfile build+push: use ${AWS_REGION:-} instead of ${AWS_REGION} so
  set -euo pipefail does not abort before the friendly error check when
  the variable is unset
- maskfile build+push: override repository_name to
  fund/model-trainer-server-worker when package_name is model-trainer,
  fixing mismatch between the single shared ECR repo in infrastructure
  and the fund/${package_name}-${stage_name} derivation that would
  produce non-existent repo names
- maskfile build: translate hyphens to underscores via tr when resolving
  the Dockerfile path for application services, fixing broken path for
  data-manager, portfolio-manager, and ensemble-manager whose directories
  use underscores
- CI workflow: change model-trainer matrix paths from the comma-separated
  string models/Dockerfile,models/__init__.py (treated as a single
  literal glob by dorny/paths-filter, never matching) to models/**
- workflow.py: tighten configuration type hint from dict | None to
  dict[str, Any] | None per project type hint guidelines
- test_deploy.py: add assertion for timezone parameter
- test_workflow.py: pass epoch_count=1 to two tests that previously ran
  the full 20-epoch production schedule, keeping unit tests cheap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
maskfile.md (1)

220-221: ⚠️ Potential issue | 🟡 Minor

Increase ECS waiter timeout to avoid flaky deploy failures.

Line 220 and Line 221 currently allow only ~60 seconds, which is commonly insufficient for ECS service stabilization during rolling updates.

⏱️ Suggested timeout adjustment
 aws ecs wait services-stable --cluster "$cluster" --services "$service" \
-    --waiter-config '{"Delay": 10, "MaxAttempts": 6}'
+    --waiter-config '{"Delay": 15, "MaxAttempts": 20}'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maskfile.md` around lines 220 - 221, The ECS waiter timeout is too short for
service stabilization; update the aws ecs wait services-stable invocation (the
command using "$cluster" and "$service") to increase the waiter-config
Delay/MaxAttempts values so the total wait is much longer (e.g., Delay 15 and
MaxAttempts 40 for ~10 minutes) to avoid flaky deploy failures during rolling
updates; modify the waiter-config JSON accordingly to reflect the new Delay and
MaxAttempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@models/tide/src/tide/workflow.py`:
- Around line 26-38: DEFAULT_CONFIGURATION is exported as a regular mutable dict
which can be accidentally mutated; replace it with an immutable mapping (e.g.,
wrap the literal with types.MappingProxyType or expose a
get_default_configuration() that returns a shallow copy) so callers cannot
change module-level defaults. Update the top-level symbol DEFAULT_CONFIGURATION
to be the immutable MappingProxyType of the existing dict (or replace usages to
call get_default_configuration()), import types.MappingProxyType if needed, and
ensure any tests or code that previously mutated DEFAULT_CONFIGURATION instead
obtain a mutable copy via dict(DEFAULT_CONFIGURATION) or the getter.

---

Duplicate comments:
In `@maskfile.md`:
- Around line 220-221: The ECS waiter timeout is too short for service
stabilization; update the aws ecs wait services-stable invocation (the command
using "$cluster" and "$service") to increase the waiter-config Delay/MaxAttempts
values so the total wait is much longer (e.g., Delay 15 and MaxAttempts 40 for
~10 minutes) to avoid flaky deploy failures during rolling updates; modify the
waiter-config JSON accordingly to reflect the new Delay and MaxAttempts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: aab16c18-6084-48fe-bf77-573c8021d82a

📥 Commits

Reviewing files that changed from the base of the PR and between eecab19 and a570618.

📒 Files selected for processing (5)
  • .github/workflows/launch_infrastructure.yaml
  • maskfile.md
  • models/tide/src/tide/workflow.py
  • models/tide/tests/test_deploy.py
  • models/tide/tests/test_workflow.py

Comment thread models/tide/src/tide/workflow.py
Copilot AI review requested due to automatic review settings March 16, 2026 02:16
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.claude/commands/update-pull-request.md (1)

32-74: ⚠️ Potential issue | 🟠 Major

Add strict error handling to all bash blocks to catch silent failures in jq transformations.

Blocks 1 and 3 execute external commands (gh api, jq) without error checking. Block 2 validates Block 1's output, but Block 3's jq transformations produce files with no downstream validation. If jq fails on malformed input or file I/O issues, the output files are empty or invalid, and downstream steps (lines 164+, 335–336) proceed silently with corrupted data.

Add set -euo pipefail at the start of each bash block (lines 32, 78, and 103) to exit immediately on command failure.

Suggested fix
   ```bash
+  set -euo pipefail
   # Set up scratchpad directory with session isolation
   : "${TMPDIR:=/tmp}"
   ```bash
+  set -euo pipefail
   SCRATCHPAD="<value printed above>"
   
   if ! command -v jq >/dev/null 2>&1; then
   ```bash
+  set -euo pipefail
   SCRATCHPAD="<value printed above>"
   
   echo "Extracting structured data from raw responses..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/commands/update-pull-request.md around lines 32 - 74, The script
currently runs gh and jq invocations (e.g., gh api calls that write
pr_meta_raw.json, pr_comments_raw.json, pr_reviews_raw.json,
pr_review_comments_raw.json, check_runs_raw.json and the INLINE_QUERY GraphQL
call that writes thread_ids_raw.json) without strict failure handling; add set
-euo pipefail at the top of each bash block that performs external commands (the
scratchpad creation block that sets SCRATCHPAD and the blocks that call gh/jq
and that perform jq transformations) so the script exits on errors/undefined
variables and failed pipes, and also ensure any jq transformations that write
derived files validate input (e.g., check files exist and non-empty after jq
runs) before proceeding; update blocks around SCRATCHPAD creation, the gh api
fetches, and the jq processing/INLINE_QUERY usage accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/commands/update-pull-request.md:
- Around line 307-330: The script currently uses hardcoded placeholder thread
IDs in the threads list and calls urllib.request.urlopen without a timeout or
error checking; update it to read real thread IDs from the scratchpad file
instead of the hardcoded "threads" list, add a timeout to the network call (use
the timeout parameter on urllib.request.urlopen), and validate the GraphQL
response by checking for an "errors" key and HTTP error conditions before using
data; also wrap the token retrieval, network call (urllib.request.urlopen) and
JSON parsing in try/except to log/print failures and continue processing other
thread IDs so one error doesn’t hang or abort the whole script.
- Around line 56-60: The REST gh api calls that save PR data (the commands
writing to "${SCRATCHPAD}/pr_meta_raw.json",
"${SCRATCHPAD}/pr_comments_raw.json", "${SCRATCHPAD}/pr_reviews_raw.json", and
"${SCRATCHPAD}/pr_review_comments_raw.json") only fetch the first page; add the
--paginate flag to each of those gh api invocations so all pages are retrieved.
For the GraphQL query that extracts review threads (the block capped at 100),
implement cursor-based pagination (use endCursor/hasNextPage and loop until
hasNextPage is false) so all threads are fetched, and when using jq to write
combined outputs, add --slurp to produce a single JSON array to avoid fragmented
output. Ensure the same --paginate and pagination approach is applied to the
other REST/GraphQL review/comments fetches referenced in the file (including the
review-threads section).

---

Outside diff comments:
In @.claude/commands/update-pull-request.md:
- Around line 32-74: The script currently runs gh and jq invocations (e.g., gh
api calls that write pr_meta_raw.json, pr_comments_raw.json,
pr_reviews_raw.json, pr_review_comments_raw.json, check_runs_raw.json and the
INLINE_QUERY GraphQL call that writes thread_ids_raw.json) without strict
failure handling; add set -euo pipefail at the top of each bash block that
performs external commands (the scratchpad creation block that sets SCRATCHPAD
and the blocks that call gh/jq and that perform jq transformations) so the
script exits on errors/undefined variables and failed pipes, and also ensure any
jq transformations that write derived files validate input (e.g., check files
exist and non-empty after jq runs) before proceeding; update blocks around
SCRATCHPAD creation, the gh api fetches, and the jq processing/INLINE_QUERY
usage accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 30d71002-9f0d-4c39-ac89-07d5835a35b5

📥 Commits

Reviewing files that changed from the base of the PR and between a570618 and b5a9f30.

📒 Files selected for processing (1)
  • .claude/commands/update-pull-request.md

Comment thread .claude/commands/update-pull-request.md Outdated
Comment thread .claude/commands/update-pull-request.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates model-training code and modernizes infrastructure/image handling by standardizing service naming, ECR repositories, and ECS deployment workflows (aligned with the newer Prefect pattern).

Changes:

  • Moved TiDE train_model (and its default configuration) into tide.workflow, updating tests accordingly and removing tide.trainer.
  • Standardized ECS/ECR naming (notably “*-server” suffixes) and consolidated model-trainer images into a single ECR repository.
  • Updated automation: adjusted Prefect deployment schedule (weekdays + timezone), removed scheduled infra teardown/data sync workflows, and added a CLI deploy command for ECS services.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
models/tide/src/tide/workflow.py Adds DEFAULT_CONFIGURATION + train_model to workflow module; removes deferred import from old trainer.
models/tide/src/tide/trainer.py Removes the old training module (now consolidated into workflow).
models/tide/tests/test_workflow.py Updates patching target and adds unit tests for train_model now in tide.workflow.
models/tide/tests/test_trainer.py Removes trainer-focused tests (moved into workflow tests).
models/tide/src/tide/deploy.py Updates Prefect deployment schedule (weekday cron + timezone).
models/tide/tests/test_deploy.py Aligns tests with updated cron + timezone.
maskfile.md Reworks image build/push semantics around package_name, adds image deploy, and updates model commands.
.github/workflows/launch_infrastructure.yaml Updates CI to build/push and now attempt ECS deploy per matrix entry.
.github/workflows/teardown_infrastructure.yaml Removes scheduled teardown (manual only).
.github/workflows/sync_data.yaml Removes scheduled data-sync workflow.
infrastructure/storage.py Renames ECR repos/lifecycle policies; consolidates model-trainer repo naming.
infrastructure/compute.py Renames ECS resources, service discovery names, log groups, and Prefect state/broker resources.
infrastructure/__main__.py Updates Pulumi exports to match new repo/service names and URLs.
libraries/python/tests/test_infrastructure_storage.py Updates assertions for renamed lifecycle policy resources.
models/Dockerfile Renames build stage to server-worker.
models/tide/Dockerfile Renames CUDA stage from trainer to runner.
infrastructure/github_environment_runbook.md Documents automated population of env secrets via bootstrap.
.claude/commands/update-pull-request.md Updates operational guidance to prefer REST + Python over MCP GitHub tools.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread models/tide/src/tide/workflow.py
Comment thread .github/workflows/launch_infrastructure.yaml Outdated
Comment thread maskfile.md Outdated
Comment thread infrastructure/compute.py
…resolve script, Pulumi stack selection

- Add --paginate flag to REST array endpoint fetches in update-pull-request.md,
  piping through jq -s 'add' to flatten multi-page responses into a single array
- Replace hardcoded placeholder thread IDs in Python resolve script with
  automatic loading from thread_ids.json in scratchpad; add timeout=30 and
  GraphQL error checking
- Add pulumi org get-default + pulumi stack select pattern to mask infrastructure
  image deploy command, matching the existing convention used in other mask commands
- Add PULUMI_ACCESS_TOKEN env var to the deploy step in launch_infrastructure.yaml
  so Pulumi is authenticated when the deploy command runs in CI

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 74-83: The pre-stack ECS service deploy step (the step named
"Deploy ${{ matrix.application }}-${{ matrix.stage }}" that runs "mask
infrastructure image deploy") can fail if services or infra are renamed; reorder
the workflow so the Pulumi stack operation ("mask infrastructure stack up" run
in the launch_infrastructure job) happens before any per-service deploys, i.e.
make the "launch_infrastructure" job run immediately after build/push and have
the matrix/service deploy steps depend on its completion (or move "mask
infrastructure image deploy" into a job that lists launch_infrastructure as a
prerequisite), ensuring "mask infrastructure image deploy" only runs after "mask
infrastructure stack up".
- Around line 4-7: The workflow currently triggers on every push to master;
narrow the trigger by adding on.push.paths to only infra/service-relevant
patterns (e.g. infrastructure/**, pulumi/**, deploy/**, charts/**,
scripts/pulumi/**) and/or on.push.paths-ignore for docs/metadata (e.g. **/*.md)
so the Pulumi job only runs for relevant changes; update the YAML's on.push
block (keeping branches: [master]) to include these path patterns to reduce
unnecessary full infra runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 638dec68-1539-4040-8435-6934342165a8

📥 Commits

Reviewing files that changed from the base of the PR and between b5a9f30 and 74b9232.

📒 Files selected for processing (3)
  • .claude/commands/update-pull-request.md
  • .github/workflows/launch_infrastructure.yaml
  • maskfile.md

Comment thread .github/workflows/launch_infrastructure.yaml
Comment thread .github/workflows/launch_infrastructure.yaml Outdated
Copilot AI review requested due to automatic review settings March 16, 2026 02:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 98-154: The deploy_images job duplicates the matrix and path
filter definitions; consolidate by extracting the matrix and path-filter config
into YAML anchors (e.g., define a top-level anchor like &matrix-config for the
matrix entries and &service-paths for the paths list) and then reference them in
deploy_images (strategy.matrix: <<: *matrix-config) and in the Check for service
changes step (use the anchored list for the filters input), ensuring job name
deploy_images and the paths-filter step id changes still consume the anchored
values; alternatively create a reusable workflow that accepts the matrix as an
input and have both deploy_images and build_and_push_images call it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ebca9af7-8cb3-46d3-bf49-c9db9995a547

📥 Commits

Reviewing files that changed from the base of the PR and between 74b9232 and edfb386.

📒 Files selected for processing (1)
  • .github/workflows/launch_infrastructure.yaml

Comment thread .github/workflows/launch_infrastructure.yaml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes Docker/ECR/ECS naming and deployment mechanics across the repo, consolidates TiDE training code into the workflow module, and updates CI/runbooks to match the new infrastructure + Prefect patterns.

Changes:

  • Consolidates TiDE model training logic into tide.workflow and migrates unit tests accordingly.
  • Standardizes ECR repo names / ECS service names (notably adding consistent -server suffixes and consolidating model-trainer images).
  • Reworks mask infra commands and GitHub Actions workflow steps for build/push/deploy, while removing scheduled data-sync and infra scheduling.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
models/tide/tests/test_workflow.py Adds coverage for train_model now located in tide.workflow; fixes patch target.
models/tide/tests/test_trainer.py Removes trainer-focused tests after moving training logic.
models/tide/tests/test_deploy.py Updates expected Prefect deployment schedule (weekdays + timezone).
models/tide/src/tide/workflow.py Introduces DEFAULT_CONFIGURATION + train_model and removes tide.trainer dependency from the flow.
models/tide/src/tide/trainer.py Removes the standalone trainer module (training now in workflow).
models/tide/src/tide/deploy.py Deploys Prefect schedule with weekday cron and explicit timezone.
models/tide/Dockerfile Renames build stage to runner to match “tide runner” semantics.
models/Dockerfile Renames stage to server-worker for combined model-trainer image.
maskfile.md Consolidates image build/push logic, adds ECS deploy command, updates model commands and service naming.
libraries/python/tests/test_infrastructure_storage.py Updates assertions for new lifecycle policy resource names.
infrastructure/storage.py Renames ECR repos/lifecycle policies; consolidates model-trainer to a single repo.
infrastructure/github_environment_runbook.md Updates runbook to reflect bootstrap-driven secret population.
infrastructure/compute.py Renames ECS services, SD names, log groups, and Prefect state/broker resources to new naming scheme.
infrastructure/main.py Updates Pulumi exports for renamed repos/images and new service discovery hostname.
.github/workflows/teardown_infrastructure.yaml Removes scheduled teardown (now manual only).
.github/workflows/sync_data.yaml Removes the sync-data workflow.
.github/workflows/launch_infrastructure.yaml Updates build/push/deploy flow and matrix naming; removes schedule/manual triggers.
.claude/commands/update-pull-request.md Updates automation guidance to use REST + Python for thread resolution reliability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/launch_infrastructure.yaml Outdated
Comment thread .github/workflows/launch_infrastructure.yaml
coderabbitai[bot]
coderabbitai Bot previously approved these changes Mar 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cleans up build/deploy and infrastructure naming, consolidates TiDE model-training code into the workflow module, and updates CI/CD + tooling to match the new ECS/model-trainer layout and Prefect scheduling pattern.

Changes:

  • Consolidate TiDE training logic/tests by moving train_model + defaults into tide.workflow and updating tests accordingly.
  • Rename/standardize infrastructure resources (ECR repos, ECS services, service discovery names) and update workflows/tooling to build/push/deploy by package_name + stage_name.
  • Adjust operational automation: remove scheduled sync/teardown workflows and change training deployment to weekday-only with an explicit timezone.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
models/tide/tests/test_workflow.py Adds train_model unit tests and updates patching/imports after trainer consolidation.
models/tide/tests/test_trainer.py Removes trainer-specific tests (moved into workflow tests).
models/tide/tests/test_deploy.py Updates expectations for weekday cron + timezone in Prefect deployment.
models/tide/src/tide/workflow.py Adds DEFAULT_CONFIGURATION + train_model implementation; removes import dependency on removed trainer module.
models/tide/src/tide/trainer.py Removes standalone trainer module (logic moved into workflow).
models/tide/src/tide/deploy.py Updates Prefect deployment schedule (weekday-only) and sets timezone.
models/tide/Dockerfile Renames the CUDA stage target from trainer to runner.
models/Dockerfile Renames stage target from worker to server-worker for the consolidated model-trainer image.
maskfile.md Renames/expands image commands (build/push/deploy), standardizes naming, and updates Pulumi import hints.
libraries/python/tests/test_infrastructure_storage.py Updates infra snapshot expectations for renamed ECR lifecycle resources.
infrastructure/storage.py Changes S3 bucket resource types/names and standardizes ECR repository/lifecycle resources.
infrastructure/github_environment_runbook.md Updates runbook to reflect automated secret population via mask ... --bootstrap.
infrastructure/compute.py Renames ECS services/log groups/service discovery and refactors Prefect backing resources to “model-trainer-*”.
infrastructure/__main__.py Updates exported outputs for renamed ECR resources and training API URL.
.github/workflows/teardown_infrastructure.yaml Removes scheduled teardown; keeps manual-only runs.
.github/workflows/sync_data.yaml Removes automated data sync workflow.
.github/workflows/launch_infrastructure.yaml Reworks to matrix build/push + separate deploy step; removes scheduled launch.
.claude/commands/update-pull-request.md Updates internal PR-update instructions to prefer REST + gh api and Python for GraphQL mutations.
Comments suppressed due to low confidence (2)

infrastructure/compute.py:383

  • Renaming the RDS instance resource and changing identifier will force a brand-new database to be created (the old one cannot be renamed in-place), which risks downtime/data loss for Prefect unless there’s an explicit migration plan. If the intent is a logical rename only, keep the existing identifier and add ResourceOptions(aliases=[...]) to adopt the prior Pulumi resource; if a new DB is intended, document the migration/backup steps and how the new secret ARN will be wired into consumers.
# RDS PostgreSQL for Prefect database
model_trainer_state = aws.rds.Instance(
    "model_trainer_state",
    identifier="fund-model-trainer-state",
    engine="postgres",
    engine_version="14",
    instance_class="db.t3.micro",
    allocated_storage=20,
    db_name="prefect",
    username="prefect",
    manage_master_user_password=True,
    db_subnet_group_name=prefect_rds_subnet_group.name,
    vpc_security_group_ids=[prefect_rds_security_group.id],
    skip_final_snapshot=False,
    final_snapshot_identifier=f"fund-model-trainer-state-final-{pulumi.get_stack()}",
    backup_retention_period=7,
    storage_encrypted=True,
    deletion_protection=True,
    tags=tags,
)

infrastructure/compute.py:426

  • Changing the ElastiCache cluster_id (and Pulumi logical name) will replace the Redis cluster. If existing environments depend on the current broker state/endpoint, add aliases and keep the existing cluster_id, or document the migration and ensure all clients (Prefect server/worker) are updated to the new endpoint as part of the same rollout.
# ElastiCache Redis for Prefect messaging
model_trainer_broker = aws.elasticache.Cluster(
    "model_trainer_broker",
    cluster_id="fund-model-trainer-broker",
    engine="redis",
    engine_version="7.0",
    node_type="cache.t3.micro",
    num_cache_nodes=1,
    subnet_group_name=prefect_elasticache_subnet_group.name,
    security_group_ids=[prefect_redis_security_group.id],
    tags=tags,
)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread infrastructure/storage.py
Comment thread infrastructure/storage.py
Comment thread infrastructure/storage.py
@forstmeier forstmeier added the rust Rust code updates label Mar 18, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes Mar 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates Docker image handling and deployment workflows, renames/standardizes ECS + ECR/Pulumi resources (notably the model-trainer services), and moves TiDE training logic into the workflow module while adjusting Prefect deployment scheduling.

Changes:

  • Reworked Mask + GitHub Actions workflows for build/push/deploy, removed scheduled sync/teardown automation, and added an explicit ECS deploy command.
  • Renamed and consolidated AWS/Pulumi resources (ECR repos, ECS services, service discovery, logs, Prefect DB/Redis) to consistent *-server / model-trainer-* naming.
  • Moved TiDE train_model + default config into tide.workflow, updated tests, and adjusted training deployment schedule to weekdays with an explicit timezone.

Reviewed changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
uv.lock Adds pytz to the lockfile dependency set.
models/tide/tests/test_workflow.py Updates workflow tests and adds coverage for the relocated train_model.
models/tide/tests/test_trainer.py Removes trainer tests now that training logic moved out of tide.trainer.
models/tide/tests/test_deploy.py Updates expected Prefect cron schedule + timezone.
models/tide/src/tide/workflow.py Introduces DEFAULT_CONFIGURATION + train_model and removes import from the deleted trainer module.
models/tide/src/tide/trainer.py Deletes the old trainer module (training logic relocated).
models/tide/src/tide/deploy.py Deploys Prefect flow on weekdays and sets timezone.
models/tide/Dockerfile Renames the CUDA stage to runner (aligns with “tide runner” usage).
models/Dockerfile Renames the runtime stage to server-worker for the consolidated model-trainer image target.
maskfile.md Consolidates image commands under infrastructure image, adds deploy, updates Pulumi imports, and adds model command routing.
libraries/python/tests/test_infrastructure_storage.py Updates infra snapshot expectations for new ECR repo lifecycle resource names.
infrastructure/storage.py Updates S3 bucket resources and standardizes ECR repository naming + lifecycle resources.
infrastructure/notifications.py Documents Pulumi config variable controlling monthly budget.
infrastructure/github_environment_runbook.md Updates runbook to reflect automated env-secret population via mask ... --bootstrap.
infrastructure/compute.py Renames ECS services/SD/log groups and consolidates server+worker to a single model-trainer image URI; tweaks ALB config.
infrastructure/__main__.py Updates Pulumi stack outputs to reflect renamed repos/images and training API hostname.
infrastructure/Pulumi.production.yaml Raises monthlyBudgetLimitUsd config value.
applications/portfolio_manager/pyproject.toml Adds pytz dependency.
applications/ensemble_manager/Dockerfile Copies tools/ into the image (supports shared tooling usage).
applications/data_manager/Dockerfile Sets CARGO_BUILD_JOBS=4 during the build stage.
.github/workflows/teardown_infrastructure.yaml Removes scheduled teardown; keeps manual workflow dispatch.
.github/workflows/sync_data.yaml Removes the sync-data workflow.
.github/workflows/launch_infrastructure.yaml Switches to a matrix build/push + deploy workflow; removes schedule trigger.
.claude/commands/update-pull-request.md Updates internal PR-update automation guidance to prefer REST + Python for thread resolution.
Comments suppressed due to low confidence (1)

models/Dockerfile:24

  • The server-worker stage is now the canonical build target for the model-trainer image, but it hardcodes an ENTRYPOINT that starts a Prefect worker. Since the ECS model-trainer server task definition uses this same image and tries to override behavior via command, the server container may not start correctly (ECS command won’t replace ENTRYPOINT). Consider using CMD (so ECS can override), or provide separate server and worker targets/images.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread infrastructure/storage.py
Comment thread infrastructure/compute.py
Comment thread infrastructure/compute.py
Comment thread infrastructure/storage.py
Comment thread infrastructure/compute.py
Comment thread infrastructure/Pulumi.production.yaml
Comment thread applications/portfolio_manager/pyproject.toml
Comment thread infrastructure/storage.py
@chrisaddy chrisaddy merged commit 1daa616 into master Mar 18, 2026
12 checks passed
@chrisaddy chrisaddy deleted the docker-image-handling-updates branch March 18, 2026 18:59
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Overview Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

markdown Markdown code updates python Python code updates rust Rust code updates yaml YAML code updates

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants